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For many networks of scientific interest we know both the connections of the network and informa¬ 
tion about the network nodes, such as the age or gender of individuals in a social network, geographic 
location of nodes in the Internet, or cellular function of nodes in a gene regulatory network. Here 
we demonstrate how this “metadata” can be used to improve our analysis and understanding of 
network structure. We focus in particular on the problem of community detection in networks and 
develop a mathematically principled approach that combines a network and its metadata to detect 
communities more accurately than can be done with either alone. Crucially, the method does not 
assume that the metadata are correlated with the communities we are trying to find. Instead the 
method learns whether a correlation exists and correctly uses or ignores the metadata depending on 
whether they contain useful information. The learned correlations are also of interest in their own 
right, allowing us to make predictions about the community membership of nodes whose network 
connections are unknown. We demonstrate our method on synthetic networks with known struc¬ 
ture and on real-world networks, large and small, drawn from social, biological, and technological 
domains. 


I. INTRODUCTION 

Networks arise in many fields and provide a powerful 
and compact representation of the internal structure of 
a wide range of complex systems [l[. Examples include 
social networks of interactions among people, technolog¬ 
ical and information networks such as the Internet or the 
World Wide Web, and biological networks of molecules, 
cells, or entire species. The last two decades have wit¬ 
nessed rapid growth both in the availability of network 
data and in the number and sophistication of network 
analysis techniques. Borrowing ideas from graph theory, 
statistical physics, computer science, statistics, and other 
areas, network analysis typically aims to characterize a 
network’s structural features in a way that sheds light on 
the behavior of the system the network describes. Stud¬ 
ies of social networks, for instance, might identify the 
most influential or central individuals in a population. 
Studies of road networks can shed light on traffic flows 
or bottlenecks within a city or country. Studies of path¬ 
ways in metabolic networks can lead to a more complete 
understanding of the molecular machinery of the cell. 

Most research in this area treats networks as objects of 
pure topology, unadorned sets of nodes and their inter¬ 
actions. Most network data, however, are accompanied 
by annotations or metadata that describe properties of 
nodes such as a person’s age, gender, or ethnicity in a 
social network, feeding mode or body mass of species in 
a food web, data capacity or physical location of nodes 
on the Internet, and so forth. (There can be metadata 
data on the edges of a network as well as on the nodes 0 , 
but our focus here is on the node case.) In this paper, 
we consider how to extend the analysis of networks to di¬ 
rectly incorporate such metadata. Our approach is based 
on methods of statistical inference and can in principle 


be applied to a range of different network analysis tasks. 
Here, we focus specifically on one of the most widely 
studied tasks, the community detection problem. Com¬ 
munity detection, also called node clustering or classifi¬ 
cation, searches for a good division of a network’s nodes 
into groups or classes 0. Typically, one searches for 
assortative structure, groupings of nodes such that con¬ 
nections are denser within groups than between them. 
This structure is common in social networks, for exam¬ 
ple, where groups may correspond to sets of friends or 
coworkers, but it also occurs in other cases, including bi¬ 
ological and ecological networks, the Web, transportation 
and distribution networks, and others. Less common, but 
no less important, is disassortative structure, in which 
network connections are sparser within groups than be¬ 
tween them, and mixtures of assortative and disassorta- 
tive structure can also occur, where different groups may 
have varying propensities for within- or between-group 
connections. 

In many cases, the groups identified by community de¬ 
tection correlate meaningfully with other network prop¬ 
erties or functions, such as allegiances or personal in¬ 
terests in social networks ii or biological function in 
metabolic networks [^, Some recent research, how¬ 
ever, has suggested that these cases may be the exception 
rather than the rule 0) 0 j important point that we 
address later in this paper. 

A large number of methods have been proposed 
for detecting communities in unannotated networks 0. 
Among these, some of the most powerful, both in terms 
of rigorously provable performance and of raw speed, are 
those based on statistical inference. Here we build on 
these methods to incorporate node metadata into the 
community detection problem in a principled and flex¬ 
ible manner. The resulting methods have several attrac- 
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tive features. First, they can make use of metadata in 
arbitrary format to improve the accuracy of community 
detection. Second, and crucially for our goals, they do 
not assume a priori that the metadata correlate with the 
communities we seek to find. Instead, they detect and 
quantify the relationship between metadata and commu¬ 
nity, if one exists, then exploit that relationship to im¬ 
prove the results. Even if the correlation is imperfect 
or noisy, the method can still use what information is 
present to return improved results. Conversely, if no cor¬ 
relation exists the method will automatically ignore the 
metadata, returning results based on network structure 
alone. 

Third, our methods allow us to select between com¬ 
peting divisions of a network. Many networks have a 
number of different possible divisions Q. For example, 
a social network of acquaintances may have meaningful 
divisions along lines of age, gender, race, religion, lan¬ 
guage, politics, or many other variables. By incorporat¬ 
ing metadata that correlate with a particular division of 
interest, we can favor that division over others, steering 
the analysis in a desired direction. (Approaches like this 
are sometimes referred to as “supervised learning” tech¬ 
niques, particularly in the statistics and machine learning 
literature.) Thus, if we are interested for instance in a di¬ 
vision of a social network along lines of age, and we have 
age data for some fraction of the nodes, we can use those 
data to steer the algorithm toward age-correlated divi¬ 
sions. Even if the metadata are incomplete or noisy, the 
algorithm can still use them to guide its analysis. How¬ 
ever, if we hand the algorithm metadata that do not cor¬ 
relate with any good division of the network, the method 
will decline to follow along blindly, and will inform us 
that no good correlation exists. 

Einally, the correlation between metadata and network 
structure learned by the algorithm (if one exists) is in¬ 
teresting in its own right. Once found, it allows us to 
quantify the agreement between network communities 
and metadata, and to predict community membership 
for nodes for which we lack network data and have only 
metadata. If we have learned, for example, that age is 
a good predictor of social groupings, then we can make 
quantitative predictions of group membership for indi¬ 
viduals about whom we know their age and nothing else. 

A number of other researchers have investigated ways 
to incorporate metadata into network analysis, though 
they have typically made stronger assumptions about the 
relationship between metadata and communities [lol - [l^ . 
Perhaps closest to our approach are the semi-supervised 
learning methods |13l - [l^ , which treat the case where we 
are given the exact community assignments of some frac¬ 
tion of the nodes and the goal is to deduce the reminder. 
A variant of this approach is active learning, in which 
the community membership of some nodes is given, but 
the known nodes are not specified a priori, being instead 
chosen by the algorithm itself as it runs [l^, [l3|. An¬ 
other vein of research, somewhat further from our ap¬ 
proach, considers the case where we are told some pairs 


of nodes that either definitely are or definitely are not 
in the same community, and then assigns communities 
subject to these constraints 

In the following sections we describe our method in de¬ 
tail, and apply it to a selection of example networks. We 
show that it recovers known communities in benchmark 
data sets with higher accuracy than algorithms based on 
network structure alone, that we can select between com¬ 
peting community divisions in both real and synthetic 
tests, that the method is able accurately to divine corre¬ 
lations between network structure and metadata, or de¬ 
termine that no such correlation exists, and that learned 
correlations between structure and metadata can be used 
to predict community membership based on metadata 
alone. 


II. METHODS 

Our method makes use of techniques of Bayesian sta¬ 
tistical inference in which we construct a generative net¬ 
work model possessing the specific features we hope to 
find in our data, namely community structure and a cor¬ 
relation between that structure and node metadata, then 
we fit the model to an observed network plus accompany¬ 
ing metadata and the parameters of the fit tell us about 
the structure of the network. 

The model we use is a modified version of a stochastic 
block model. The original stochastic block model, pro¬ 
posed in 1983 by Holland et al. [1^, is a simple model for 
generating random networks with community structure 
in which nodes are divided among some number of com¬ 
munities and edges are placed randomly and indepen¬ 
dently between them with probabilities that depend only 
on the communities to which the nodes belong. We mod¬ 
ify this model in two ways. Eirst, following [2l| . we note 
that the standard stochastic block model does poorly at 
mimicking the structure of networks with highly hetero¬ 
geneous degree sequences (which includes nearly all real- 
world networks), and so we include a “degree-correction” 
term that matches node degrees (i.e., the number of con¬ 
nections each node has) to those of the observed data. 
Second, we introduce a dependence on node metadata 
via a set of prior probabilities. The prior probability of 
a node belonging to a particular community becomes a 
function of the metadata, and it is this function that 
is learned by our algorithm in order to incorporate the 
metadata into the calculation. 

Consider an undirected network with n nodes labeled 
by integers u = 1... n, divided among k communities, 
and denote the community to which node u belongs 
by s„ S 1... fc. In the simplest case, we consider meta¬ 
data with a finite number K of discrete, unordered values 
and we denote node u’s metadata by G 1... AT. The 
choice of labels 1... AT is arbitrary and does not imply 
an ordering for the metadata or that the metadata are 
one-dimensional. If a social network has two-dimensional 
metadata describing both language and race, for exam- 


3 


pie, we simply encode each possible language/race com¬ 
bination as a different value of x: English/white, Span¬ 
ish/white, English/black, and so forth. If a network has 
nodes that are missing metadata values, we just let “miss¬ 
ing” be another metadata value. 

Given metadata x = {xu\ and degree d = {(i„} for all 
nodes, a network is generated from the model as follows. 
First, each node u is assigned to a community s with a 
probability depending on u's metadata Xu- The probabil¬ 
ity of assignment we denote 7 ^ 3 , for each combination s, x 
of community and metadata, so the full prior probabil¬ 
ity on community assignments is P(s | F, x) = ]/[j ^si,xi , 
where F denotes the k x K matrix of parameters ■jgx- 
(More complex forms of the prior are appropriate in other 
cases, as we will see.) Once every node has been assigned 
to a community, edges are placed independently at ran¬ 
dom between nodes, with the probability of an edge be¬ 
tween nodes u and v being 
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( 1 ) 

where Ogt are parameters that we specify, with 9st = dts ■ 
The factor dudy allows the model to fit arbitrary degree 
sequences as described above. Models of this kind have 
been found to fit community structure in real networks 
well [ 2 l|. 

Community detection then consists of fitting the model 
to observed network data using the method of maximum 
likelihood. Given an observed network, we define its ad¬ 
jacency matrix A to be the nx n real symmetric matrix 
with elements auv = 1 if there is an edge between nodes u 
and V and 0 otherwise. Then the probability, or likeli¬ 
hood, that this network was generated by our model, 
given the parameters and metadata, is 

P(A I 0, F, x) = ^ P{A I 0, s)P(s I F, x) 
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( 2 ) 

where 0 is the k x k matrix with elements 9st and the 
sum is over all possible community assignments s. 

Fitting the model involves maximizing this likelihood 
with respect to 0 and F to determine the most likely val¬ 
ues of the parameters, which we do using an expectation- 
maximization (EM) algorithm. A full derivation of the 
algorithm is given in Appendix but the central result 
is that the optimal parameter values are 
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where Sxy is the Kronecker delta and 
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with 


g(s) = 


P(A|0,s)P(s|F,x) 

E.F(A|0,s)P(s|F,x) 


= P(s| A,0,F,x). 


( 5 ) 


Physically, is the marginal posterior probability that 
node u belongs to community s and q™ is the joint prob¬ 
ability that nodes u and v belong to s and t respectively. 
Normally, in fact, q/ is the object of primary interest in 
the calculation, as it tells us to which group each node 
belongs, i.e., it tells us the optimal division of the net¬ 
work into communities. The prior probabilities jsx are 
also of interest, since they tell us how and to what ex¬ 
tent the metadata are correlated with the communities, 
a point discussed further in Section IIIII 

Computationally, the most demanding part of the EM 
algorithm is calculating the sum in the denominator of 
Eq. ([ 5 ]), which has an exponentially large number of 
terms, making its direct evaluation intractable on all but 
the smallest of networks. Traditionally one gets around 
this problem by approximating the full distribution q{s) 
by Monte Carlo importance sampling. Here, we instead 
use a recently proposed alternative method based on be¬ 
lief propagation |22j | , which is significantly faster and fast 
enough in practice for applications to very large net¬ 
works. (In separate work we have successfully applied 
the method of this paper to a network of over 1.4 million 
nodes.) 

We also consider cases in which the metadata are or¬ 
dered and potentially continuous variables, such as age 
or income in a social network, which require a different 
algorithm. The prior probability P(s | x) of belonging to 
community s given metadata value x becomes a contin¬ 
uous function of cc, which we write as an expansion in a 
set of basis functions Bj(x), parametrized by the coef¬ 
ficients of the expansion: P(s|a;) = E^=QlsjBj{x). 
The result for the optimal value of Ogt is still given by 
Eq. ([3]), but the optimal values of the 7 sj are given by 
the solution of the equations 

^ qtQT ^ isjBjjxu) . . 

Etud^Qf’ ^ EkisMxu)- 

A full derivation is given in Appendix These equations 
can be conveniently and rapidly solved by simple itera¬ 
tion, starting with the current best estimate of "/sj and 
alternating between the equations until convergence is 
achieved. In our implementation, we use Bernstein poly¬ 
nomials as the basis functions, although other choices are 
possible. 


III. RESULTS 

We have applied the method to a range of example net¬ 
works, including computer-generated benchmarks that 
test its ability to detect known structure, as well as a 
variety of real-world networks. 


A. Synthetic networks 

Our first tests are on computer-generated (“syn¬ 
thetic” ) networks that have known community structure 
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embedded within them. These networks were created us¬ 
ing the standard stochastic block model, in which nodes 
are assigned to groups, then edges are placed between 
them independently with probabilities that are a function 
of group membership only [ 2 O, [HI ■ After the networks 
are created, we generate discrete-valued node metadata 
at random that match the true community assignments 
of nodes a given fraction of the time and are chosen ran¬ 
domly from the non-matching values otherwise. This al¬ 
lows us to control the extent to which the metadata cor¬ 
relate with the community structure and hence test the 
algorithm’s ability to make use of metadata of varying 
quality. 

Figured^ shows results for a set of such networks with 
two communities of equal size, with edge probabilities 
Pin = Cin/n and Pout = CovLt/n for within-group and 
between-group edges, respectively. When Cin is much 
greater than Cout the communities are easy to detect from 
network structure alone, but as Cin approaches Cout the 
structure becomes weaker and harder to detect. Each 
curve in the figure shows the fraction of nodes that are 
classified into their correct groups by our algorithm, as 
we vary the strength of the community structure, mea¬ 
sured by the difference Cin — Cout • Individual curves show 
results for different levels of correlation between commu¬ 
nities and metadata. 

When metadata and community agree for exactly half 
of the nodes (bottom curve) there is no correlation be¬ 
tween the two and the metadata cannot help in commu¬ 
nity detection. It thus comes as no surprise that this 
curve shows the lowest success rate. For the higher levels 
of correlation the metadata contain useful information 
and the algorithm’s performance improves accordingly. 

Examining the figure, a clear pattern emerges. For 
large Cin — Cout the network contains strong community 
structure and the algorithm reliably classifies essentially 
all nodes into the correct groups, as we would expect of 
any effective algorithm. As the structure weakens the 
fraction of correct nodes declines, but it remains higher 
in all the cases where the metadata are useful than in 
the lowest curve where they are not. Moreover, the al¬ 
gorithm’s success rate appears to improve monotonically 
with the level of correlation between metadata and com¬ 
munities. 

When there are no metadata, it is known that the EM 
algorithm gives optimal answers to the community de¬ 
tection problem in the sense that no other algorithm 
will classify a higher fraction of nodes correctly on av¬ 
erage [12 . The fact that our algorithm does better when 
there are metadata thus implies that the algorithm with 
metadata does better than any possible algorithm without 
metadata. 

Furthermore, it has previously been shown that be¬ 
low the so-called detectability threshold, which occurs at 
Cin — Cout = •\/2(cin -|- Cout) (indicated by the vertical 
dashed line in the figure, and aligning with the sharp 
transition in the bottom curve), community structure be¬ 
comes so weak as to be undetectable by any algorithm 



FIG. 1: Tests on synthetic benchmark networks with n = 
10 000 nodes, (a) Fraction of correctly assigned nodes for 
networks with two planted communities with mean degree 
c = 8, as a function of the difference between the numbers 
of within- and between-group connections. The five curves 
show results for networks with a match between metadata and 
planted communities on a fraction 0.5, 0.6, 0.7, 0.8, and 0.9 of 
nodes (bottom to top). The vertical dashed line indicates the 
theoretical detectability threshold, below which no algorithm 
without metadata can detect the communities, (b) Fraction 
of 100 four-group test networks where the algorithm selects a 
particular 2-way division, out of several competing possibili¬ 
ties, with and without the help of metadata that are weakly 
correlated with the desired division. A run is considered to 
find the correct division if the fraction of correctly classi¬ 
fied nodes exceeds 85%. Network parameters are Cout = 4 
and Cin = 20. 


that relies on network structure alone [12 [12 ■ Well be¬ 
low this threshold, however, our algorithm still correctly 
classifies a fraction of the nodes roughly equal to the frac¬ 
tion of metadata that match the communities, meaning 
that the algorithm does better with metadata than with¬ 
out it even below the threshold. One can understand 
this result theoretically by observing that ^sx = ^sx is 
a fixed point of Eqs. o to m, so that assigning each 
node u to the group indicated by its metadata Xu is al¬ 
ways a solution. Figure [1^ also shows that the fraction 
of correctly classified nodes beats this baseline level for 
values of Cin — Cout somewhat below the threshold, sug¬ 
gesting that the use of the metadata shifts the threshold 
downward or perhaps eliminates it altogether. 

In short, our method automatically combines the avail¬ 
able information from network structure and metadata 
to do a better job of community detection than any al¬ 
gorithm based on network structure alone. And when 
either the network or the metadata contain no informa¬ 
tion about community structure the algorithm correctly 
ignores them and returns an estimate based only on the 
other. 

Figure [It shows a different synthetic test, of the algo¬ 
rithm’s ability to select between competing divisions of a 
network. In this test, networks were generated with four 
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equally sized communities but the algorithm was tasked 
with finding a division into just two communities. There 
are eight ways of dividing such a network in two if we 
are to keep the four underlying groups undivided. We 
imagine a situation in which we are interested in finding 
a particular one out of these eight. A conventional com¬ 
munity detection algorithm may find a reasonable divi¬ 
sion of the network, but there is no guarantee it will find 
the “correct” one—some fraction of the time we can ex¬ 
pect it to find one of the competing divisions. But if the 
algorithm is given a set of metadata that correlate with 
the division of interest, even if the correlation is poor, 
the likelihood of that division will be increased relative 
to the others and it will become favored. 

In our tests the desired division was one that places 
two of the underlying four groups in one community and 
the remaining two in the other. Two-valued metadata 
were generated that agree with this division 65% of the 
time, a relatively weak level of correlation, not far above 
the 50% of completely uncorrelated data. Nonetheless, 
as shown in Figure [Ho, this is enough for the algorithm to 
reliably find the correct division of the network in almost 
every case—98% of the time in our tests. Without the 
metadata, by contrast, we succeed only 6% of the time. 
Some practical applications of this ability to select among 
competing divisions are given in the next section. 


B. Real-world networks 

In this section we describe applications to three real- 
world networks, drawn from social, biological, and tech¬ 
nological domains respectively. Two further applications 
are given in Appendix [C] 

School friendships: For our first application we analyze 
a network of school students, drawn from the US Na¬ 
tional Longitudinal Study of Adolescent Health [2^ . The 
network represents patterns of friendship, established by 
survey, among the 795 students in a medium-sized Amer¬ 
ican high school (US grades 9 to 12, ages 14 to 18 years) 
and its feeder middle school (grades 7 and 8, ages 12 to 
14 years). 

Given that this network combines middle and high 
schools, it comes as no surprise that there is a clear di¬ 
vision (previously documented) into two network com¬ 
munities corresponding roughly to the two schools [^ . 
Previous work, however, has also shown the presence of 
divisions by ethnicity. Our method allows us to select 
between divisions by using metadata that correlate with 
the one in which we are interested. 

Figure shows the results of applying our algorithm 
to the network three times. Each time, we asked the 
algorithm to divide the network into two communities. 
In Fig. [5^, we used the six school grades as metadata and 
the algorithm readily identifies a division into grades 7 
and 8 on the one hand and grades 9 to 12 on the other— 
i.e., the division onto middle school and high school. In 
Fig.Hb, by contrast, we used the students’ self-identified 
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FIG. 2: Three divisions of a school friendship network, using 
as metadata (a) school grade, (b) ethnicity, and (c) gender. 


ethnicity as metadata, which in this data set takes one of 
four values: white, black, hispanic, or other (plus a small 
number of nodes with missing data). Now the algorithm 
finds a completely different division into two groups, one 
group consisting principally of black students and one 
of white. (The small number of remaining students are 
distributed roughly evenly between the groups.) 

One might be concerned that in these examples the 
algorithm is mainly following the metadata to deter¬ 
mine community memberships, and ignoring the network 
structure. To test for this possibility, we performed a 
third analysis, using gender as metadata. When we do 
this, as shown in Fig. [5J:, the algorithm does not find a 
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9 Herbivore 
9 Primary producer 


FIG. 3: Three-way decomposition of the marine food web 
described in the text, with the logarithm of mean body mass 
used as metadata. Node sizes are proportional to log-mass 
and colors indicate species role within the ecosystem. 


division into male and female groups. Instead, it finds 
a new division that is a hybrid of the grade and eth¬ 
nicity divisions (white high-school students in one group 
and everyone else in the other). That is, the algorithm 
has ignored the gender metadata, because there was no 
good network division that correlated with it, and in¬ 
stead found a division based on the network structure 
alone. The algorithm makes use of the metadata only 
when doing so improves the quality of the network divi¬ 
sion, in the sense of increasing the value of the likelihood. 

The extent to which the communities found by our 
algorithm match the metadata (or any other “ground 
truth” variable) can be quantified by calculating a nor¬ 
malized mutual information (NMI) [^, . NMI ranges 

in value from 0 when the metadata are uninformative 
about the communities to 1 when the metadata spec¬ 
ify the communities completely. (See the Supplemental 
Information for a detailed definition and discussion of 
normalized mutual information.) The divisions shown 
in Fig. [2^ and [21 d have NMI scores of 0.881 and 0.820 
respectively, indicating that the metadata are strongly 
though not perfectly correlated with community mem¬ 
bership. By contrast, the division in Fig. [2j:, where gen¬ 
der was used as metadata, has an NMI score of 0.003, 
indicating that the metadata contain essentially zero in¬ 
formation about the communities. 

Predator-prey interactions: Our next application is one 
with ordered metadata of the kind described in Sectionllll 


The network in this case is an ecological one, a food web 
of predator-prey interactions between 488 marine species 
living in the Weddell Sea, a large bay off the coast of 
Antarctica. A number of different metadata are avail¬ 
able for these species, including feeding mode (deposit 
feeder, suspension feeder, scavenger, etc.), zone within 
the ocean (benthic, pelagic, etc.), and others. In our 
analysis, however, we focus on one in particular, the 
average adult body mass. Body masses of species in 
this ecosystem have a wide range, from microorganisms 
weighing nanograms or less to hundreds of tonnes for the 
largest whales. Conventionally in such cases one often 
works with the logarithm of mass, which makes the range 
more manageable, and we do so here. Then we perform 
k-way community decompositions using this log-mass as 
metadata, for various values of k. 

Figure |3] shows the results for fc = 3. Nodes are colored 
according to their role in the ecosystem—carnivores, her¬ 
bivores, primary producers, and so forth. The division 
found by the algorithm appears to match these roles quite 
closely, with one group composed almost entirely of pri¬ 
mary producers and herbivores, one of omnivores, and a 
third that contains most of the carnivores. Node sizes in 
the figure are proportional to log-mass, which increases 
as we go up the figure, indicating that the algorithm 
has recovered from the network structure the well-known 
correlation between body mass and ecosystem role [^ . 
This point is further emphasized by the values of the 
prior probabilities of membership in the three groups as 
a function of body mass (see Fig. SI in the Supplemental 
Information), which show that low-mass organisms are 
overwhelmingly likely to be in the first group, and high- 
mass ones in the third group. Organisms of intermediate 
mass have a broader distribution, but are particularly 
concentrated in the second group. 

The prior probabilities are also of interest in their own 
right. If, for instance, we were to learn of a new species, 
previously unrepresented in our food-web data set, then 
even without knowing its pattern of network connections 
we can make a statement about its probability of belong¬ 
ing to each of the communities, as well as its probability 
of interaction with other species, so long as we know its 
body mass. For instance, a low body mass of 10“^^ g 
would put a species with high probability in group 1 in 
Fig. [21 meaning it is almost certainly a primary producer 
or a herbivore, with the interaction patterns that implies. 

Internet graph: Community detection is widely studied 
precisely because it is believed that network communi¬ 
ties are correlated with network function. More specif¬ 
ically, it is commonly assumed that communities corre¬ 
late with some underlying functional variable, which may 
or may not be observed. This assumption, however, has 
been challenged by recent work that compared communi¬ 
ties in real-world networks against “ground truth” meta¬ 
data variables and found little correlation between the 
two [3, I3|. This is a striking discovery, but there is a 
caveat. As we have seen, there are often multiple mean¬ 
ingful community divisions of a network (as in the school 
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friendship network of Fig. for example), and the fact 
that one division is uncorrelated with a given metadata 
variable does not rule out the possibility that another 
could be. 

Our third real-world example application illustrates 
these issues using one of the same networks studied in 
Ref. @ , a 46 676-node representation of the peering struc¬ 
ture of the Internet at the level of autonomous systems. 
The “ground truth” variable for this network is the coun¬ 
try in which each autonomous system is located. The 
analysis of @ found there to be little correlation between 
community structure and countries. 

We first analyze the network without metadata, per¬ 
forming a traditional “blind” community division, into 
five groups using the standard EM algorithm. We then 
repeat the analysis using the algorithm of this paper with 
the countries as metadata. Recall that, in doing this, we 
do not force the algorithm to find a community division 
that aligns with the metadata if no such division exists, 
but if a division does exist it will be favored over com¬ 
peting divisions that do not align with the metadata. 
There are 173 distinct countries in the data set, a signif¬ 
icantly larger number of metadata values than for any of 
the other networks we have considered, but by no means 
beyond the capabilities of our method. 

As before, we assess the results using the normalized 
mutual information. If indeed there are many compet¬ 
ing divisions of the network, only some of which cor¬ 
relate with the particular metadata we are given, then 
we would expect our blind analysis to return a range 
of NMI values on different runs, some low and (maybe) 
some higher. This is indeed what we see, with the NMI 
in our calculations ranging from a high of 0.626 to a rel¬ 
atively low 0.398, the latter being in agreement with re¬ 
sults quoted in Q. Conversely, when the algorithm of 
this paper is applied with countries as metadata, we find 
an NMI score significantly higher than any of these fig¬ 
ures, at 0.870, which would conventionally be interpreted 
as an indication of strong correlation. 

These results emphasize that an apparent lack of corre¬ 
lation between network communities and metadata could 
be the result of the presence of competing network divi¬ 
sions, some of which are correlated with the particular 
metadata we have in hand while others are not. The al¬ 
gorithm of this paper allows us to select among divisions 
and hence find ones that correlate with the variable of 
interest. 


IV. CONCLUSIONS 

In this paper we have described a technique for di¬ 
rectly incorporating annotations or “metadata” into the 
analysis of networks. We have focused on the problem 
of community detection, although the methods we de¬ 
scribe could in principle be applied to other analyses. 
We have shown that the incorporation of metadata im¬ 
proves the accuracy of community detection in controlled 
studies on benchmark networks and also allows us to se¬ 
lect among competing community divisions in the same 
network, a common feature of practical network data 
sets. Our method is able to infer the level of correla¬ 
tion between metadata and network structure, and will 
automatically use or ignore the metadata as appropriate 
based on this inference. We have demonstrated appli¬ 
cations of the method to a variety of data sets, includ¬ 
ing social, biological, and technological networks, finding 
improved results and flexibility of analysis compared to 
methods that do not use metadata. 

There are a number of possible extensions of this work. 
At the simplest level one could include more complex 
metadata types, such as combinations of discrete and 
continuous variables, or vector variables such as spatial 
coordinates. Metadata could also be incorporated into 
methods for detecting other structure types, such as hier¬ 
archy, core-periphery structure, rankings, or latent-space 
structure. And the resulting fits could form the starting 
point for a variety of additional applications, such as the 
prediction of missing links or missing metadata in incom¬ 
plete data sets. These and other possibilities we leave for 
future work. 
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Appendix A: EM algorithm 

In this appendix we present the derivation of the 
expectation-maximization (EM) algorithm used to fit 
our model to empirical network data. 

1. Unordered data 

Given a network, represented by its adjacency ma¬ 
trix A, plus the accompanying vector of metadata x, our 
goal is to determine the values of the parameter matrices 
0,r that maximize the likelihood of the network 

P(A|0,r,x) = ^P(A|0,s)P(s|r,x), (Al) 

s 

where 

P(A|0, S) = ^ n Pu7 (1 - (A2) 

s u<.v 

and 

P(s|r,x)=n7,„,,„, (A3) 

u 

with 

Puv — tiu(^vPsu,Sv ^ (A4) 

and du being the degree of node u. Typically, rather 
than maximizing dAH) itself, we maximize instead its log¬ 
arithm, 

logP(A|0,r,x) = log^P(A|0,s)P(s|r,x), (A5) 

s 

which gives the same answer for 0 and F but is often 
more convenient. 

The most obvious approach for performing the maxi¬ 
mization would be simply to differentiate with respect to 
the parameters, set the result to zero, and solve the re¬ 
sulting equations. This, however, produces a complex set 
of implicit equations that have no easy solution. Instead, 
therefore, we make use of Jensen’s inequality, which says 


9 


that for any set of positive quantities Xi the log of their 
sum obeys 


log ^ Xi > ^ log —, (A6) 

i i 


where qi is any correctly normalized probability distribu¬ 
tion such that X^i = 1- Note that the exact equality is 
recovered by the particular choice 


ft = 


Xj 


(A7) 


Applying Jensen’s inequality to Eq. (IA5I) . we find that 


logP(A|©,r,x) > E g(s) log 

S 


P(A|0,s)P(s|r,x) 

9(s) 


= E ^(A|0, s) + ^ g(s) log P(s|r, x) 

s s 


log<7(s), 

S 


(A8) 


where q(s) is any distribution over community assign¬ 
ments s such that maximum of the 

right-hand side of this inequality with respect to possible 
choices of the distribution g(s) coincides with the exact 
equality, which, following Eq. (TATI) . is when 


P(A|0,s)P(s|r,x) 

E.P(A|0,s)P(s|r,x)- 


(A9) 


Thus the maximization of the left-hand side of (lASp with 
respect to 0, F to give the optimal values of the param¬ 
eters is equivalent to a maximization of the right-hand 
side both with respect to q{s) (which makes it equal to 
the left-hand side) and with respect to 0, F. A simple al¬ 
gorithm for performing such a double maximization is to 
repeatedly maximize with respect to first q{s) and then 
0, F until we converge to an answer. In other words: 


1. Make an initial guess about the parameter values 
and use them to calculate the optimal q{s) from 
Eq. (1X91) . 

2. Using that value, maximize the right-hand side 
of (IA8I) with respect to the parameters, while hold¬ 
ing q{s) constant. 

3. Repeat from step 1 until convergence is achieved. 

Step 2 can be performed by differentiating with q{s) fixed 
and subject to the normalization constraint jsx = 1 
for all X. Performing the derivatives and assuming that 
the network is large and sparse so that p^v is small, we 
find to leading order in small quantities that 


^st 


(^uvqst 

J2uv dudvqst ’ 


Tsrc — 


dx,x^q'i 


(AlO) 


where 


qr = ^ = E 9is)Ss^,sSs.,t. (All) 

s s 

In addition, for a large sparse network, the community as¬ 
signments of distant nodes will be uncorrelated and hence 
we can write g™ ~ in the denominator of (IA10|) to 
get 


T,u duQ^ dyq^ ’ 


(A12) 


which reduces the denominator sums from terms to 
only n and considerably speeds the calculation. (We 
cannot make the same factorization in the numerator, 
since the terms in the numerator involve g™ on adjacent 
nodes u, v only, so the nodes are not distant from one 
another.) 

Equation (|A9|) tells us that once the iteration con¬ 
verges, the value of g(s) is 


P(A|0,s)P(s|F,x) P(A,s|0,F,x) 

Esmi®,s)P(s|F,x) P(A|0,F,x) 

= P(s|A,0,F,x). (A13) 


In other words g(s) is the posterior distribution over com¬ 
munity assignments s, the probability of an assignment s 
given the inputs A, 0, F, and x. 


2. Final likelihood value 

The EM algorithm always converges to a maximum of 
the likelihood but is not guaranteed to converge to the 
global maximum—it is possible for there to be one or 
more local maxima as well. To get around this problem 
we normally run the algorithm repeatedly with different 
random initial guesses for the parameters and from the 
results choose the one that finds the highest likelihood 
value. In the calculations presented in this paper we did 
at least ten such “random restarts” for each network. 

To determine which run has the highest final value of 
the likelihood we calculate the likelihood from the right- 
hand side of (IA8I) using P(A|0,s) and P(s|F,x) as in 
Eqs. (IA2I) and (IA3I) . the final fitted values of the param¬ 
eters 0 and F from the EM algorithm, and g(s) as in 
Eq. (IA9I) . (As we have said, the right-hand side of (IA8I) 
becomes equal to the left, and hence equal to the true 
log-likelihood, when g(s) is given the value in Eq. (IA9I1 .1 
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Putting it all together, our expression for the log-likelihood is 

logP(A|©,r,x) = ^(?(s) ^ [auv\og{dudv9s^,s„) + (1 - a„„)log(l - dudy9s^,sj] 

s u<iv 

log7s„,a:„ -^(?(s)logq(s). 

S U S 

Neglecting terms beyond first order in small quantities, the first sum can be rewritten as 

uv(^ogdu + \ogdy + \og9st) - qstdudv9st] 

uv st 

= 5 ^ log (i„ + ^ dy log (i„ + ^ log 9 St E “ E E > 

^ U V st uv st uv -* 

where we have made use of J2st dst = 1 J2y ^uy = dy 


(A14) 


(A15) 


The first two terms in (jA15ll are constant for any given 
network and hence can be neglected—they are irrele¬ 
vant for comparing the likelihood values between differ¬ 
ent runs on the same network. The final term can be 
rewritten using Eq. (lAlOl) as 


The final sum in (IA14I) is the entropy of the posterior 
distribution q{s), which is harder to calculate because it 
requires not just the marginals of q but the entire dis¬ 
tribution. We get around this by making the so-called 
Bethe approximation [2^ 


9st E dudyqst = E E = E 

st UV st uv uv 

which is also a constant and can be neglected. Thus only 
the third term in (IA15I) need be carried over. 

The second sum in (IA14I) is 

E E = E 9“ log7s.x„ 

s u su 

^ E E log7sx = E log7sa: 

SU X USX 


where we have used Eq. (lAlOll again in the second line. 


9(s) 


nu<y[9Zs.r 


(A18) 


which is exact on trees and locally tree-like networks, 
and is considered to be a good working approximation on 
other networks. Substituting this form into the entropy 
term gives 


E9(s)logg(s) 


UV st 

-Yidu-l)YZ^0S<ls- (A19) 

u s 


I 

Combining Eqs. (IA15p to (IA19I1 and substituting into Eq. (IA14I) . our final expression for the log-likelihood, neglecting 
constants, is 


log p(A|©, r, x) = i ^ log 9st E +E E is,x^ - 5 E E q^ 

st UV u s uv st 

+ E(^“-l)E9“loggr- (A20) 

u s 

The run that returns the largest value of this quantity is the run with the highest likelihood and hence the best fit to 
the model. ._ 


3. Ordered metadata 

The case of ordered metadata, such as the body masses 
used in the food web example of Eig. 4 in the main pa¬ 
per, is more involved. Let P(s|a:) be the prior proba¬ 


bility that a node belongs to community s given meta¬ 
data X. In most cases the metadata have a finite range 
and for convenience we normalize them to fall in the 
range x G [0,1]. (In the rarer case of metadata with 
infinite range a transformation can be applied first to 
bring them into a finite range.) One immediate ques- 
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tion that arises is what limitations should be placed on 
the form of the probability P(s|a;). We cannot allow it to 
take any functional form, such as ones that vary arbitrar¬ 
ily rapidly, for (at least) two reasons. First, it would be 
unphysical—there are good reasons in most cases to be¬ 
lieve that nodes with infinitesimally different metadata x 
have only infinitesimally different probabilities of falling 
in a particular group. In other words, P(s|a:) should be 
smooth and slowly varying in some sense. Second, a func¬ 
tion that can vary arbitrarily rapidly can have arbitrarily 
many degrees of freedom, which would lead to overfitting 
of the model. 

To avoid of these problems, we enforce a slowly varying 
prior by writing the function P(s|a:) as an expansion in 
a finite set of suitably chosen basis functions. In our 
work we use polynomials of finite degree. There is an 
interesting model selection problem inherent in the choice 
of degree which we do not tackle here but which would 
be a good topic for future research. 

For representing probability functions in [0,1], as here, 
a convenient choice of polynomial basis is the Bernstein 
polynomials of degree N\ 

Bj[x) = (1 — x)^~^, j = 0...iV. (A21) 

Bernstein polynomials have three particular properties 
that make them useful for representing probabilities: 

1. They form a complete basis set for polynomials of 
degree N. 

2. They fall in the range 0 < Bj{x) < 1 for all x € 
[ 0 , 1 ] and all j. 

3. They satisfy the sum rule 

N 

Y.B,{x) = 1 (A22) 

3=0 


for all X G [0,1]. 


Finally, the normalization condition that P(s|a;) = 
1 for all X can be satished by requiring that 


Y.lsj = l (A25) 


SO that 

N N 

^P(s|a;) =^^ 7 sjPj(a;) = ^ Bj{x) = l. (A26) 

s s j—Q j—0 

We now employ the form (|A23I) to represent the prior 

probabilities in our EM algorithm, writing 

P(s|r,x) =np(s„|x„). (A27) 

U 

The only change to the algorithm from the previous case 
arises when we maximize the right-hand side of Eq. (IA 8 I) . 
Instead of maximizing with respect to the prior proba¬ 
bilities directly, we now maximize with respect to the 
coefficients of the expansion. The optimal values of 

the coefficients are given by 


= argmax^ 9 “ log ^ jtkBk{xu), (A28) 
I ut k 

subject to the constraint (IA25I) . One can derive condi¬ 
tions for the maximum by direct differentiation, but the 
equations do not have a closed-form solution, so instead 
we once again employ Jensen’s inequality (IA 6 I) to write 


9t“ log^ 7tfc-Sfc(a:„) *31“ log , 

(A29) 

which is true for any Q™ satisfying J2j Qf^ = 1 for 
all u, s. The exact equality is achieved when 


Qf 


lsjBj{Xu) 

^skBi^i^Xii) 


(A30) 


The first of these implies that any degree-A^ representa¬ 
tion of the probability P(s|a;) can be written in the form 

N 

P(s|a;) =J2lsjBj{x) (A23) 

3=0 

for some choice of coefficients 7 sj. Moreover, if 73 ^ G [0,1] 
for all s,j then P(s|a;) G [0,1] for all x G [0,1], meaning 
it is a well-defined probability within this domain. To 
see this observe first that P(s|a;) > 0 when 'jsj > 0 since 
all Bj{x) > 0, and second that for < 1 we have 

N N 

P(s|a;) = '^-is]Bj{x) <'^Bj{x) = 1, (A24) 

3=0 j=0 

where we have made use of Eq. (IA22I) . 


and the maximum of Eq. (IA28I) can be computed by 
first maximizing over Q™ in this way and then over 'jsj. 
This suggests an iterative algorithm analogous to the EM 
algorithm in which one computes the from (IA30I) 
and then, using those values, computes the maximum 
with respect to jsj by differentiating the right-hand side 
of (IA29I) subject to the condition (IA25|) . which gives 


= 


EuqsQr 

EtuqtQf 


(A31) 


Iterating (IA30|) and (IA31I) alternately to convergence now 
gives us the coefficients 7 ^^ of the optimal degree-poly¬ 
nomial prior. Note that (IA31I) always gives jsj in the 
range from zero to one, so that, as discussed above, the 
resulting prior P(s|a;) also lies between zero and one and 
is thus a lawful probability. 
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4. Implementation 

The calculations for this paper were implemented in 
the C programming language for speed. There are a num¬ 
ber of additional techniques that can be used to improve 
speed and convergence. We find that the majority of the 
running time of the algorithm is taken up by the belief 
propagation calculations, and this time can be shortened 
by noting that highly converged values of the beliefs are 
pointless in early steps of the EM algorithm. The param¬ 
eter values used to calculate the beliefs in these steps are, 
presumably, highly inaccurate since the EM algorithm 
has not converged yet, so there is little point spending a 
large amount of time waiting for the beliefs to converge to 
many decimal places when there are much bigger sources 
of error in the calculation. In the calculations of this pa¬ 
per, we limited the belief propagation to no more than 
20 steps at any point. In the early stages of the EM 
algorithm this gives rather crude values for the beliefs, 
but these values would not be particularly good under 
any circumstances, no matter how many steps we used, 
because of the poor parameter values. In the later stages 
of the EM algorithm, 20 steps are enough to ensure good 
convergence (and indeed we often get good convergence 
after many fewer steps than this). 

We also place a limit on the total number of itera¬ 
tions of the EM algorithm, discarding results that fail to 
converge within the allotted time. In the calculations in 
this paper, this second limit was set at either 20 or 100 
steps. We have performed some runs with higher limits 
(up to 1000 EM steps) but, paradoxically, we find this 
often gives poorer results, for instance in our tests on 
synthetic networks. This seems to be because the EM 
algorithm sometimes converges (as we have said) to the 
wrong solution and empirically when it does so it also 
often converges more slowly. By discarding runs that 
converge slowly, therefore, we tend to discard incorrect 
solutions and improve the average quality of our results. 

Appendix B: Normalized mutual information 

In this appendix we discuss the definition of the nor¬ 
malized mutual information that we use to measure the 
quality of the results given by our algorithm. 

The most widely used measure of agreement between 
community divisions and “ground truth” variables is the 
normalized mutual information (NMI), first employed in 
this context by Danon et al. [26|. Given a community 
division represented by the n-element vector s and dis¬ 
crete metadata represented by x, the conditional entropy 
of the community division is 

P(s|a;) log P(s|a;), (Bl) 

X S 

where P{x) is the fraction of nodes with metadata x and 
P(s|a;) is the probability that a node belongs to commu¬ 
nity s if it has metadata x. Traditionally the logarithm 


is taken in base 2, in which case the units of conditional 
entropy are bits. 

In our case we already know the value of P(s|a;): it is 
equal to the prior probability ^sx of belonging to com¬ 
munity s, one of the outputs of our algorithm. Hence 

P(s|x) = - l0g7sx 

X S 

n{x) 

= - V Isx logq^x 

X S 

= --y^7s,a:„l0g7s,a;„, (B2) 

su 

where n{x) = nP{x) is the number of nodes with meta¬ 
data x and n is the total number of nodes in the network, 
as previously. 

The conditional entropy is equal to the amount of ad¬ 
ditional information one would need, on top of the meta¬ 
data themselves, in order to specify the community mem¬ 
bership of every node in the network. If the metadata are 
perfectly correlated with the communities, so that know¬ 
ing the metadata tells us the community of every node, 
then the conditional entropy is zero. Conversely, if the 
metadata are worthless, telling us nothing at all about 
community membership, then the conditional entropy 
takes its maximum value, equal to the total entropy of 
the community assignment H{s) = — Pis)\ogP{s). 
Alternatively, if we want a measure that increases (rather 
than decreases) with the amount of information the 
metadata give us, we can subtract iJ(s|x) from H{s), 
which gives the (unnormalized) mutual information 

/(s ; x) = iJ(s) — i?(s|x). (B3) 

This has a range from zero to H{s), making it poten¬ 
tially hard to interpret, so commonly one normalizes it, 
creating the normalized mutual information. There are 
several different normalizations in use. As discussed by 
McDaid et al. [^, it is mathematically reasonable to 
normalize by the larger, the smaller, or the mean of the 
entropies H{s) and j7(x) of the communities and meta¬ 
data. Danon et al. |26l | in their original work used the 
mean, while Hric et al. Q, in their work on lack of cor¬ 
relation between communities and metadata, used the 
maximum. In the present case, however, we contend that 
the best choice is the minimum. 

Since the maximum value of the mutual information 
is H{s), this sets the scale on which it should be con¬ 
sidered large or small. Thus one might imagine the cor¬ 
rect normalization would be achieved by simply dividing 
/(s ; x) hy H (s), yielding a value that runs from zero to 
one. This, however, would give a quantity that was asym¬ 
metric with respect to s and x—if the values of the two 
vectors were reversed the value of the mutual informa¬ 
tion would change. Mutual information, by convention, 
is symmetric and we would prefer a symmetric normal¬ 
ization scheme. Dividing by min[i7(s), iJ(x)] achieves 
this. In all the examples we consider, the number of 
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communities is less than the number of metadata values, 
in some cases by a wide margin. Assuming the values of 
both to be reasonably broadly distributed, this implies 
that the entropy H{s) of the communities will be smaller 
than that of the metadata H{x.) and hence, normally, 
min[iJ(s), i?(x)] = H{s). Thus if we define 


NMI = 


min[iJ(s), i?(x)] ’ 


(B4) 


we ensure that the normalized mutual information lies 
between zero and one, that it has a symmetric defini¬ 
tion with respect to s and x, and that it will achieve 
its maximum value of one when the metadata perfectly 
predict the community membership. Other definitions, 
normalized using the mean or maximum of the two en¬ 
tropies, satisfy the first two of these three conditions but 
not the third, giving values smaller than one by an unpre¬ 
dictable margin even when the metadata perfectly pre¬ 
dict the communities. 

We use the definition (|B4p in all the calculations pre¬ 
sented in this paper. 
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Appendix C: Further examples 

In this appendix we present a number of additional 
applications of our methods as well as some additional 
details on examples described in the main text. Summary 
statistics on all the networks studied are given in Table ID 


FIG. 4: Learned prior probability of community membership 
for two five-way divisions of the Facebook friendship network 
of Harvard students described in the text. The horizontal axis 
is (top) year of graduation and (bottom) dorm, and the colors 
represent the prior probabilities of membership in each of the 
communities. 


1. Facebook friendship network 

The FBIOO data set of Traud et al. is a set of 
friendship networks among college students at US uni¬ 
versities compiled from friend relations on the social net¬ 
working website Facebook. The networks date from the 
early days of Facebook when its services were available 
only to college students and each university formed a sep¬ 
arate and unconnected subgraph in the larger network. 
The nodes in these networks represent the students, the 
edges represent friend relations on Facebook, and in ad¬ 
dition to the network structure there are metadata of 
several types, including gender, college year (i.e., year 
of college graduation), major (i.e., principal subject of 
study, if known), and a numerical code indicating which 
dorm they lived in. 

The primary divisions in these networks appear to be 
by age, or more specifically by college year. For instance, 
we have looked in some detail at the data for Harvard 
University, which was the birthplace of Facebook and 
its biggest institutional participant at the time the data 
were gathered, with 15 126 students in the network, span¬ 
ning college years 2003 to 2009. There are also a small 
number of Harvard alumni (i.e., former students) in the 
data set, primarily those recently graduated—graduation 
years 2000-2002. The top panel in Fig. 2] shows results 


from a five-way division of the network using our algo¬ 
rithm with year as metadata. Year, for the purposes of 
this calculation, was treated as an unordered variable, 
placing no constraints on the value of the prior probabil¬ 
ities of community membership for adjacent years. One 
could have treated it as an ordered variable, which would 
have constrained adjacent years to have similar priors, 
but we did not do that here. Nonetheless, as we will see, 
the algorithm finds communities in which adjacent years 
tend to be grouped together. 

This network provides a good example of the useful¬ 
ness of the learned priors in shedding light on the struc¬ 
ture of the network. The figure shows a visualization of 
the priors as a function of year, with the colors show¬ 
ing the relative probability of belonging to each of the 
communities. Each of the bars in the plot has the same 
height of 1 since the prior probabilities are required to 
sum to 1, while the balance of colors shows the distribu¬ 
tion over communities. Examination of the top panel in 
the figure shows clearly a division of the network along 
age lines. Two groups, in orange and yellow at the right 
of the plot, correspond to the most recent two years of 
students at the time of the study (graduation years 2008 
and 2009) and the next, in red, accounts for the two years 
before that (2006 and 2007). The purple community cor¬ 
responds to the next three years, 2003-2005, while the 
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normalized mutual information 


network 

domain 

nodes n 

edges m 

metadata 

blind 

with metadata 

School friendships 

Predator-prey interactions 

Internet graph 

Facebook friendships 

Malaria gene recombinations 

social 

ecological 

technological 

social 

biological 

795 

488 

46 676 

15 126 

297 

2072 

15 880 

262 953 

1 649 234 

2 684 

grade 

ethnicity 

gender 

species body mass 
ecological role 
country 

graduation year 
dormitory 
Cys-PoLV labels 

0.105-0.384 

0.120-0.239 

0.000-0.010 

0.348-0.443 

0.396-0.626 

0.573-0.641 

0.074-0.224 

0.077-0.675 

0.881 

0.820 

0.003 

0.595 

0.870 

0.668 

0.255 

0.596 


TABLE I: Summaries of real-world network examples and their node metadata variables. For each we report the normalized 
mutual information (NMI) between the metadata and the communities found without metadata (“blind”) and found using the 
methods described in this paper. 


sixth group, shown in blue, corresponds to the alumni. 
Finally, students for whom year was not recorded are 
shown in the column marked “None,” which is a mixture 
of all five groups. 

These results align well with the original analysis of 
the same data by Traud et al. [slj, who performed a 
traditional community division of the network and then 
carried out post hoc statistical tests to measure correla¬ 
tions between communities and metadata. They found 
strong correlations with college year metadata, in agree¬ 
ment with our results. With the benefit of hindsight the 
results may appear unsurprising—anyone who has been 
to college knows that a large number of your friends are in 
the same year as you—but one could certainly formulate 
competing hypotheses. One alternative that Traud et al. 
considered was that friendship might be influenced by 
where students live, with students living in the same dor¬ 
mitory more likely to be friends, regardless of what year 
they are in. Traud et al. found that there was some evi¬ 
dence for this hypothesis, but that the effect was weaker 
than that for age, and our analysis confirms this. The 
bottom panel in Fig. |4] shows a plot of the priors for a 
division with dorm as the metadata variable and there is 
a clear correlation between dorm and community mem¬ 
bership, but it is not as clean as in the case of age. There 
appear to be two groups that align strongly with partic¬ 
ular sets of dorms (colored red and purple in the figure) 
while the rest of the dorms are a mix of different com¬ 
munities (the region in the middle of the figure). The 
impression that the community structure is more closely 
aligned with graduation year than with dormitory is also 
borne out by the normalized mutual information values 
for the two divisions. For the case of graduation year the 
NMI is 0.668; for dormitory it is 0.255. 

2. Malaria gene recombination network 

Malaria, which is caused by the parasite P. falci¬ 
parum, is endemic in tropical regions and is responsible 
for roughly a million deaths annually, mostly children 
in sub-Saharan Africa [s^. During infection, parasites 
evade the host immune system and prolong the infection 


by repeatedly changing a protein camouflage displayed 
on the surface of an infected red blood cell. To enable 
this behavior, each parasite has a repertoire of roughly 
60 immunologically distinct proteins, each of which is en¬ 
coded by a var gene in the parasite’s genome [s^- These 
genes undergo frequent recombination, producing novel 
proteins by shuffling and splicing substrings from existing 
var genes. 

The process of recombination induces a natural bipar¬ 
tite network with two types of nodes, var genes on the one 
hand and their constituent substrings on the other, where 
each gene node is connected by an edge to every substring 
it contains [13, . Recombination in these genes occurs 

mainly within a number of distinct highly variable re¬ 
gions (HVRs) and each HVR represents a distinct set of 
edges among the same nodes. Here, we focus on the one¬ 
mode gene-gene projections of the HVR 5 and HVR 6 
subnetworks, which have previously been analyzed using 
community detection methods without metadata [13, 
Each of these one-mode networks consists of 297 genes. 

We analyze these networks using the methods de¬ 
scribed in this paper. As metadata, we use the Cys 
labels derived from the HVR 6 sequence and the Cys- 
PoLV (CP) labels derived from the sequences adjacent 
to HVRs 5 and 6 [H, |13, [13 ■ Both types of labels de¬ 
pend only on the sequences’ characteristics: Cys indicates 
the number of cysteines the HVR 6 sequence contains 
(2 or 4) while CP subdivides the Cys classifications into 
6 groups depending on particular sequence motifs. Thus, 
each node has two metadata values, a Cys label and a CP 
label. The Cys labels are biologically important because 
cysteine counts have been implicated in severe disease 
phenotypes [33L[36l|. 

In our calculations we use the six CP labels as meta¬ 
data for a 2-way community division of the network and 
then evaluate the degree to which the inferred communi¬ 
ties correlate with the Cys metadata. Figure [5] shows the 
results for the HVR 6 network with and without the CP 
labels as metadata. Without metadata, the Cys labels 
are mixed across the inferred groups (Fig. [5^), but with 
metadata we obtain a nearly perfect partition (Fig. [SJj). 
This indicates that the CP label correlates well with the 
network’s community structure, a fact that was obscured 
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(a) Without metadata 



(b) With metadata 


FIG. 5: Inferred communities, without metadata and with, for 
the HVR 6 gene recombination network of the human malaria 
parasite P. falciparum, where metadata values are the CP 
labels for the genes. Nodes are colored by their biologically 
relevant Cys label. 


in the analysis without metadata. Furthermore, the in¬ 
ferred communities correlate strongly with the coarser 
Cys labels, which were not shown to the method: ob¬ 
serving that a gene has two cysteines is highly predictive 
(96% probability) of that gene being in one group, while 
having four cysteines is modestly predictive (67% proba¬ 
bility) of being in the other group. Thus the method has 
discovered by itself that the motif sequences that define 
the CP labels, along with their corresponding network 
communities, correlate with cysteine counts and their as¬ 
sociated severe disease phenotypes [H, . 

The communities in the HVR 6 network represent 
highly non-random patterns of recombination, which are 
thought to indicate functional constraints on protein 
structure. Previous work has conjectured that common 
constraints on recombination span distinct HVRs [sj. 
We can test this hypothesis using the methods described 
in this paper. There is no reason a priori to expect that 
the community structure of HVR 6 should correlate with 
that of HVR 5 because the Cys and CP labels are derived 
from outside the HVR 5 sequences—Cys labels reflect 
cysteine counts in HVR 6 while CP labels subdivide Cys 
labels based on sequence motifs adjacent to, but outside 



(a) Without metadata 



(b) With metadata 


FIG. 6: Inferred communities, without metadata and with, 
for the HVR 5 gene recombination network of the human 
malaria parasite P. falciparum, where metadata values are 
the CP labels for the HVR 6 network. 


of, HVR 5. Applying our methods to HVR 5 without 
any metadata (Fig. [S^), we find mixing of the HVR 6 
Cys labels across the HVR 5 communities. By contrast, 
using the CP labels as metadata for the HVR 5 network, 
our method finds a much cleaner partition (Fig. |6b), in¬ 
dicating that indeed the HVR 6 Cys labels correlate with 
the community structure of HVR 5. 


3. Weddell Sea food web 

As discussed in the main text, the Weddell Sea food 
web provides an example of the “ordered” metadata type 
in the body mass of species. A three-way community di¬ 
vision of the network with the log of species’ average 
body mass as metadata produces the division shown in 
Fig. 4 of the paper. The prior probabilities as functions 
of body mass are of interest in their own right. They are 
shown in Fig.0 Although, as described in Section lA 31 of 
the paper, the log mass is rescaled in our calculations to 
the range [0,1], the horizontal axis in the figure is cali¬ 
brated to read in terms of the original mass in grams, so 
the prior probabilities of belonging to each of the three 
communities can be simply read from the figure. The 
blue, green, and red curves correspond respectively to 
the communities labeled 1, 2, and 3 in Fig. 4. Thus a 
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FIG. 7: Learned priors, as a function of body mass, for the 
three-community division of the Weddell Sea network shown 
in Fig. 4 of the main paper. 


species with a low mean mass of 10“^^ g has about an 
80% probability of being in community 1, a 20% proba¬ 
bility of being in community 2, and virtually no chance of 
being in community 3. Conversely, a species with mean 
body mass of 10® g (which could only be a whale) has 
about a 90% chance of being in community 3, 10% of 
being in community 2, and almost no chance of being in 
community 1. 
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